-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce resumable downloads with --resume-retries #12991
base: main
Are you sure you want to change the base?
Conversation
- Added —resume-retries option to allow resuming incomplete downloads - Setting —resume-retries=N allows pip to make N attempts to resume downloading, in case of dropped or timed out connections - Each resume attempt uses the values specified for —retries and —timeout internally Signed-off-by: gmargaritis <gmargaritis@protonmail.com>
16fb735
to
dbc6a64
Compare
I'm guessing the CI fails because of the new linter rules introduced in 102d818 |
Does this do rsync-style checksums? That would increase reliability. |
Signed-off-by: gmargaritis <gmargaritis@protonmail.com>
Hey @notatallshaw 👋 Is there anything that I can do to move this one forward? |
A pip maintainer needs to take up the task of reviewing it, as we're all volunteers it's a matter of finding time. I think my main concern would be the behavior when interacting with index servers that behave badly, e.g. give the wrong content length (usually 0). Your description looks good to me, but I haven't had time to look over the code yet. |
Yeah, I know how it goes, so no worries! If you need any clarifications or would like me to make changes, I'd be happy to help! |
any chances that it'll be merged soon? |
I've had an initial cursory glace at this PR and it appears to be sufficiently high quality. I've also run the functionality locally (select a large wheel to download and then disconnect my WiFi midway through the download) and it has a good UX. My main concern, although this is a ship that has probably sailed, is it would be nice for pip not to have to directly handle HTTP intricacies and leave that to a separate library. I can’t promise a full review or other maintainers will agree, but I am adding it to the 25.1 milestone for it to be tracked. |
The PR looks good, although I’m not a http expert so I can’t comment on details like status and header handling. Like @notatallshaw I wish we could leave this sort of detail to a 3rd party library, but that would be a major refactoring. Add this PR (along with cert handling, parallel downloads, etc) to the list of reasons we should consider such a refactoring, but in the meantime I’m in favour of adding this. |
There isn’t an “approve with conditions” button, but I approve this change on the basis that someone who understands http should check the header and status handling. |
Hopefully this gets added soon, downloading GBs of stuff over slow internet and then having to restart from the beginning is not an experience i would recommend |
@Ibrahima-prog I hear ya! This is on my radar to review. I haven't gotten around to it yet. And truthfully, I probably won't find the time until at least next Thursday. This will make it into the pip 25.1 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the code a couple of times and tried to educate myself on the relevant HTTP tags, as well as trying it against PyPI. Overall I'm happy with this PR, but I have left a few comments on edge cases that I would like you to address or provide thoughts on.
On the topic of edge cases, here's an example where a range request is possible on the index but the HEAD request to check returns a 405: astral-sh/uv#11379, I don't think this PR is affected by this behavior, but it's an interesting example of how edge casey all this behavior can be.
except (ConnectionError, ReadTimeoutError): | ||
continue | ||
|
||
if total_length and bytes_received < total_length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about changing the <
to !=
?
I have seen corporate proxies return completely different data, such as a page saying the download is not allowed with the full text of the internal policies related to network traffic.
This would change the semantics of the error, something like DownloadError
instead of IncompleteDownloadError
, with verbiage related to it possibly being incomplete or a blocked network request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we more strongly suggest that the download was incomplete. I agree that we should mention the possibility that the response is total nonsense (yay for enterprise proxies), but the error message should emphasize the more likely culprit of an incomplete download. Perhaps we could check the response Content-Type
and if it isn't what we're expecting, then we can assume the response is total nonsense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this logic is within the scope of resumable downloads, I’d suggest leaving this out of scope for now, as it affects pip’s overall download behavior, not just retries.
Changing < to !=
would alter the semantics of the error, making it more about detecting completely different responses rather than just incomplete downloads. If we want to handle cases where proxies return unexpected content (like policy pages), that should be considered holistically across all downloads, not just resumable ones.
For now, the retry mechanism should continue treating incomplete downloads as the primary concern. If the connection is stable, pip won’t crash, and a broader discussion would be needed for verifying response integrity (e.g., checking Content-Type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is a preliminary review consisting of feedback that immediately came to mind. I will need to review this again in more detail.)
Thank you so much for working on this! It's great to see the resumption of incomplete downloads make progress! I left some initial comments, but I have a larger question: why are the resume retries set to 0
by default?
I'd prefer if pip's behaviour emulated that of a browser where it caches the incomplete download so the download can be resumed at some later point, but I realize that would be significantly increase the complexity of the implementation (and it'd also introduce some friction as the download would have to be manually restarted).
However, defaulting to not resuming here results in poor UX IMO. Imagine I download half of a $very-large-wheel (e.g., PyTorch @ ~1 GB) and then the download is dropped. I get a incomplete-download-error
explaining what happened. Good! I try again with --resume-retries 5
What's not so good is that pip will have to download all of said $very-large-wheel again.
If I'm on a slow or metered connection, I'd be frustrated that I have to download everything again. Doubly so if the connection failed at 90% or similar. In addition, it's not immediately clear how many resumption retries I should pass to pip. 1? 2?
Would be possible to default to allowing a few (1-3) resume attempts? That way, if the download fails halfway through, the download will be given another shot. It may not be enough if the connection is so unstable that it requires a ton of resumes, but for one-off failures, it would still be a major improvement. As long as the messaging is clear, I don't think automatic resumes would be that annoying to the user.1 I consider resumes as the preferred option and opting out of resumption to be an exceptional (but still important to support!) case.
Anyway, thank you again for your patience! I also appreciate all of the tests (although I have only scanned through them very briefly). Despite the flaws and my critiques, this is a major step forward, giving users a fighting chance to download large distributions on unstable connections.
Footnotes
-
Although if resumes are the default, perhaps we shouldn't allow the download to restart from zero (i.e., range requests are NOT supported) multiple times? Downloading the whole file numerous times over could be very slow and surprising (especially for users on metered connections) and thus be something they need to opt into... although that would make the default special. A default of
--resume-retries 3
would be treated differently from the user specifying--resume-retries 3
↩
|
||
return bytes_received | ||
|
||
def _attempt_resume( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for total_length
to be None and resumption to still function? AFAICT by reading the current logic, no. The annotation can be changed to int
and the tests for total_length
can be dropped in the function's body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed not possible for the resumption to still function if total_length
is None
, but since _attempt_resume
relies on _get_http_response_size
among other things, we have to define it as Optional[int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see that now. The underlying "problem" is that we also allow restarting the download in _attempt_resume
. See my review for more commentary on this. Specifically, I do think this area would benefit from additional refactoring to make it more obvious what's going on (and the preconditions being held), but I'm going to punt that for later.
except (ConnectionError, ReadTimeoutError): | ||
continue | ||
|
||
if total_length and bytes_received < total_length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we more strongly suggest that the download was incomplete. I agree that we should mention the possibility that the response is total nonsense (yay for enterprise proxies), but the error message should emphasize the more likely culprit of an incomplete download. Perhaps we could check the response Content-Type
and if it isn't what we're expecting, then we can assume the response is total nonsense?
@thk686 I'm very late to the party, but could you elaborate on how checksums come into play? AFAIK, indices don't serve the checksums of their distributions so there is no way pip could double check the download wasn't corrupted unless the checksums were given by the user. This PR uses conditional range requests (via the |
I was thinking of over-the-wire corruption. In some cases it can be
beneficial to checksum on both ends.
…On Wed, Mar 12, 2025 at 6:05 PM Richard Si ***@***.***> wrote:
Does this do rsync-style checksums <https://pypi.org/project/pyrsync/>?
That would increase reliability.
@thk686 <https://github.com/thk686> I'm very late to the party, but could
you elaborate on how checksums come into play? AFAIK, indices don't serve
the checksums of their distributions so there is no way pip could double
check the download wasn't corrupted unless the checksums were given by the
user. This PR uses conditional range requests (via the If-Range HTTP
request header) which will avoid the issue of the file being changed on the
server in-between requests.
—
Reply to this email directly, view it on GitHub
<#12991 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEQXSKWTOCZKZHHELLCWID2UC4SHAVCNFSM6AAAAABPMRUVG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJZGMYTKMRSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: ichard26]*ichard26* left a comment (pypa/pip#12991)
<#12991 (comment)>
Does this do rsync-style checksums <https://pypi.org/project/pyrsync/>?
That would increase reliability.
@thk686 <https://github.com/thk686> I'm very late to the party, but could
you elaborate on how checksums come into play? AFAIK, indices don't serve
the checksums of their distributions so there is no way pip could double
check the download wasn't corrupted unless the checksums were given by the
user. This PR uses conditional range requests (via the If-Range HTTP
request header) which will avoid the issue of the file being changed on the
server in-between requests.
—
Reply to this email directly, view it on GitHub
<#12991 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEQXSKWTOCZKZHHELLCWID2UC4SHAVCNFSM6AAAAABPMRUVG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJZGMYTKMRSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Timothy H. Keitt
www keittlab org
|
There has been some discussion around this in the past1 and I’d pretty much prefer it. However, I think that it’s out of scope for this first step in implementing resumable downloads, considering the amount of work needed.
I initially set the default We have two options:
Footnotes |
Signed-off-by: gmargaritis <gmargaritis@protonmail.com> (cherry picked from commit f2e48c3f5885305369b88761ab74cd16a0869667)
Signed-off-by: gmargaritis <gmargaritis@protonmail.com> (cherry picked from commit 53ce184348de1af4937dc04de7a1aedbe4ede19a)
Signed-off-by: gmargaritis <gmargaritis@protonmail.com> (cherry picked from commit 1f8d7fe0b0a5c7b53719bd8713619f982c042dbf)
Signed-off-by: gmargaritis <gmargaritis@protonmail.com> (cherry picked from commit af6b7ac624ebc18035d2da217c4c1850a6850cd7)
Signed-off-by: gmargaritis <gmargaritis@protonmail.com> (cherry picked from commit 67e366aec42d913436159ca3bf877c46a0d5cd2c)
…_download Signed-off-by: gmargaritis <gmargaritis@protonmail.com>
Just so everyone is on the same page, I plan on re-reviewing this PR sometime this week. I'm working on prototyping some code style changes which I'll share soon. Beyond that, I'd like to review the other parts of the resuming UX. After that, I should be happy enough with this to merge it and let any other suggestions be handled at a later date. |
7e165af
to
c146e81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I've completed another review pass. In this review, I've taken a closer look at the code structure and the user-facing messaging.
Firstly, I pushed three commits to your branch:
-
ff2ccd2 - I didn't like how
bytes_received
was being passed absolutely everywhere. Thus, I removed it as a parameter to_write_chunks_to_file
. The method will only return how many bytes it has downloaded. It's on the calling method to keep track of how many bytes have been downloaded. I also made some other simplifications where it didn't impact readability. -
2808134 - The point of the
exceptions.py
module is to separate the error printing logic from the business logic. For that reason, I moved all of the formatting logic inside the exception. Finally, I made some changes to the error itself:- If resumes are disabled, don't even bother saying that there are 0 attempts configured
- If resumes are disabled, provide a more specific hint to enable resumes using
--resume-retries
- Don't repeat how many attempts are configured
- "File" -> "URL"
- Drop the "error" suffix from the error code, it's redundant
-
c146e81 - The "Attempting to resume" warning was reworked to be easier to for non technical users. Also, I replaced any mention of "retries" with "attempts" as retries IMO is confusing with the
--retries
flag
Please let me know if you have any issues with these changes.
Secondly, I have some larger design questions:
(As I mention at the end, these questions are not blocking. I want to discuss them, but no changes need to be made to get this PR landed.)
-
How about we rename the
--resume-retries
flag to--resume-attempts
? I know that doesn't follow the pattern set by--retries
, but--resume-retries
is IMO confusing because it sounds similar to--retries
but configures something else entirely (in addition, the former is handled by urllib3, while we handle the latter). -
Should restarts be allowed while retrying a download? My concern is that an user will be okay with pip resuming a download (perhaps even 10s of times) but wouldn't be okay with restarting the download from scratch several times (e.g., they're on a metered connection). I had a semi-private discussion with @emmatyping about this. I've decided that this is fine for now, but I want to revisit this before we make resuming the default.
More broadly, this PR has been pretty hairy to review. That's unavoidable, of course, but I think it highlights that this feature could be confusing for end-users, too.
With this PR, there are several layers to our HTTP logic. urllib3
handles connection timeouts, read timeouts, and request retrying. These correspond to the --retries
and --timeout
flags. Now, there's also our own logic to retry a HTTP download, corresponding to --resume-retries
. Note the emphasis on "retry". While reviewing this PR, I got confused by the fact resuming and restarting are both supported. This PR generally refers to both features as "resuming" but IMO this is really "download retrying" (we resume from where we left off, but restart if resuming is impossible). This is confusing, and that's why I decided to switch to use the word "attempt" as it better encapsulates both behaviours.
Realistically, this nuance is going to be hard to get across1, but I just don't want our users to have the wrong impression of the functionality. This PR doesn't have add documentation. That's fine for now, but we should add some docs before making this the default.
Anyway, that's my review. I still haven't reviewed the tests, but that's not a blocking concern.
I want to thank you again for the PR! I will admit that I've been more nit-picky than usual. That's for two reasons: A) this is the first large pip PR I've reviewed, and B) this is complicated and I want to get this right. The latter is doubly important as it's hard to change things in pip down the line. Fortunately, backwards compatibility isn't really a huge problem with download resuming, but it's still something to consider.
On that note, I do want to get this in the pip 25.1 release, so barring any major objections, I will merge this in ~probably a week or two. I'm in favour of the PR and it's good enough to land as-is. We'll get feedback once pip 25.1 is out and we can make more changes based on that feedback as needed.
Footnotes
-
I also recognize that as a maintainer/reviewer, I am more aware of the nuance and subtlety involved here than any regular pip user. It's likely that is contributing to the confusion, and the average user won't be as confused as they won't even know enough to be confused 🙂 ↩
Oh right, here are some screenshots of the new messaging: (It says "Resuming download of" here. I'd made this adjustment earlier, but I reverted the change in my last commit) The urllib3 request retrying warnings are very noisy here, but this is probably the less common scenario for retrying. I'm disconnecting the Internet entirely to trigger the retry logic. In practice, I'd imagine the download would fail, but the request/connection can be easily restored (so urllib3 won't complain). Either way, this is a pre-existing issue and should be dealt with separately. |
Resolves #4796
Introduced the
--resume-retries
option in order to allow resuming incomplete downloads incase of dropped or timed out connections.This option additionally uses the values specified for
--retries
and--timeout
for each resume attempt, since they are passed in the session.Used
0
as the default in order to keep backwards compatibility.This PR is based on #11180